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Abstract 

In this paper, we study a special bandit setting of online stochastic linear optimization, 
where only one-bit of information is revealed to the learner at each round. This problem 
has found many applications including online advertisement and online recommendation. 
We assume the binary feedback is a random variable generated from the logit model, and 
aim to minimize the regret defined by the unknown linear function. Although the existing 
method for generalized linear bandit can be applied to our problem, the high computational 
cost makes it impractical for real-world problems. To address this challenge, we develop 
an efficient online learning algorithm by exploiting particular structures of the observation 
model. Specifically, we adopt online Newton step to estimate the unknown parameter and 
derive a tight confidence region based on the exponential concavity of the logistic loss. Our 
analysis shows that the proposed algorithm achieves a regret bound of &{d\/T), which 
matches the optimal result of stochastic linear bandits. 

Keywords: bandit, online, regret bound, stochastic linear optimization, logit model 


1. Introduction 


Online learning with bandit feedback plays an important role in several industrial domains , 
such as ad placement, website optimization, and packet routing ( Bubeck and Cesa-Bianchil . 
2012! ). A canonical framework for studying this problem is the multi-armed bandits (MAB), 
which models the si tuation that a gambler must choose which of K slot machines to 
play (iRobbinsl . Il952l ). In the basic stochastic MAB, each arm is assumed to deliver re¬ 
wards that are drawn from a fixed but unknown distribution. The goal of the gambler is 
to minimize the regret, namely the differ ence between his e xpected cumulative reward and 


that of the best single arm in hindsight (jAner et al.l . 120021 ). Although MAB is a powerful 


framework for modeling online decision problems, it becomes intractable when the num¬ 
ber of arms is very large or even infinite. To address this challenge, various algorithms 
have been designed to exploit different structure properties of the reward function, such as 
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Lipschitz ( Kleinberg et al. . 20081 ') and convex ( Flaxman et al. . 20051 : lAgarwal et al. . 20131 ). 
Among them, st ochas t ic linear bandits (SLB) has received considerable at tentions during 
the past decade (Auer, 2002; Dani et ah . 2008a : Abbasi-vadkori et al. llmll). In each round 
of SLB, the learner is asked to choose an action from a decision set P G then he 
observes yt such that 

E[?/i|xt] = xjw*, (1) 

where w* G is a vector of unknown parameters. The goal of learner is to minimize the 
(pseudo) regret 


T maxx''~w* — xJw*. 
xeD ^ 


( 2 ) 


1=1 


In this paper, we consider a special bandit setting of online linear optimization where the 
feedback yt only contains one-bit of information. In particular, yt G {±1}- Our setting is 
motivated from the fact that in many real-world applications, such as online advertising and 
recommender systems, user feedback (e.g., click or not) is usually binar y. Since the feedbac k 
is binary-valued, we assume it is generated according to the logit model ( Hastie et ah . 20091 ). 


i.e., 


Pr[yt = ±l|xi] = 


1 


exp(ytx7w* 


(3) 


1 -h exp(-ytx7w*) 1+ exp(i/tx7 w*) 

Without loss of generality, suppose 1 is the preferred outcome. Then, it is natural to define 
the regret in terms of the expected times that 1 is observed, i.e.. 


T max 


exp(x^w* 


y - 

y.ev 1 -t- exp(x~rw*) ^ I + exp(x7w. 


exp(xj'" w* 


(4) 


The observation model in ([3]) and the nonline ar regret in can be treated as a special 


case of the Generalized Linear Bandit (GLB) (jFilippi et al 


201ol ). However, the existing 


algorithm for GLB is inefficient in the sense that: i) it is not a truly online algorithm since 
the whole learning history is stored in memory and used to estimate w*; and ii) it is limited 
to the case that the number of arms is finite because an upper bound for each arm needs 
to be calculated explicitly in each round. 

The main contribution of this paper is an efficient online learning algorithm that effec¬ 
tively exploits particular structures of the logit model. Based on the analytical properties 
of the logistic function, we first show that the linear regret defined in ([2]) and the nonlinear 
regret in (jl]) only differs by a constant factor, a nd then focus on minimizing the former one 
due to its simplicity. Similar to previous studies (jBubeck and Gesa-Bianchl . 20121 ) , we follow 
the principle of “optimism in face of uncertainty” to deal with the exploration-exploitation 
dilemma. The basic idea is to maintain a confidence region for w*, and choose an estimate 
from the confidence region and an action so that the linear reward is maximized. Thus, 
the problem reduces to the construction of the confidence region from one-bit feedback that 
satishes ([3]). Based on the expone ntial concavity of t he logistic loss, we propose to use a 
variant of the online Newton step ( Kazan et ah . 200?! ) to find the center of the confidence 
region and derive its width by a rather technical analysis of the updating rule. Theoretical 
analysis shows that our algorithm achieves a regret bound of 0((i\/r)0 which matches the 


1. We use the 0 notation to hide constant factors as well as polylogarithmic factors in d and T. 
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result for SLB (jPani et al.1 . POOSal l. Furthermore, we provide several strategies to reduce 
the computational cost of the proposed algorithm. 


2. Related Work 


The stochastic multi-armed bandits (MAB) ( Robbing . 1952l h has become the canonical 
formalism for studying the problem of decision-making under u ncertainty. A long line o f 
successive problems h a-ve been extensively studied in s tatistics ( Berry and Fristedt . 1985li 
and computer science ( Bubeck and Cesa-Bianchi . 2012l i. 


2.1 Stochastic Multi-armed Bandits (MAB) 

In their seminal paper, Lai and Robbing ( 19851 ) establish an asymptotic lower bound of 
0{K\ogT) for the expected cumulative regret over T periods, under the assumption that 
the expected rewards of the best and second best arms are well-separated. By making 
use of upper confidence bounds (UCB), they further construct policies which achieve the 
lower bound asymptotically. However, this initial algorithm is quite involved, because the 
computati on of UCB relie s on the entire sequence of rewards obtained so far. To address this 
limitation, lAgrawall (jl995l i introduces a family of simpler policies that only needs to calculate 
the sample mean of rewards, and the regret retains the optima l logar ithmic behavior. A 
finite time analysis of stochastic MAB is conducted bv lAuer et al.l (120021 ). In particular, they 
propose a UCB-type algorithm based on the Chernoff-Hoeffding bound, and demonstrate 
it achieves the optimal logarithmic regret uniformly over time. 


2.2 Stochastic Linear Bandits (SLB) 

SLB is first studied by Auer ( 20021 ) . who considers the case T> is finite. Although an 
elegant UCB-type algorithm named LinRel is developed, he fails to bound its regret due to 
independence issues. Instead, he designs a complicated master algorithm which uses LinRel 
as a subroutine, and achieves a regret boun d of 0((log iPD^/^-y/ Td), where |P| is the number 
of feasible decisions. In a subsequent work, Dani et ^ ( 2008al i generalize LinRel slightly so 
that it can be applied in settings where P may be infinite. They refer to the new algorithm 
as ConfidenceBall 2 , and sho w it enjoys a bound o f 0(d VT), which does not depend on 
the cardinality of 'D. Later, Abbasi-vadkori et al. ( 20111 ) improve the theoretical analysis 
of ConfidenceBall 2 by employing tools from the self-normalized processes. Specifically, the 
worst case bound is improved by a logarithmic factor and the constant is improved. 


2.3 ConfidenceBallo 


To fac ilitate comparisons, we give a brief description of the ConhdenceBall 2 algorithm (iDani et al 


2nn8alL In each round, the algorithm maintains a confidence region Ct such that with a high 
probability w* G C*. Then, the algorithm finds the greedy optimistic decision 


xt = argmaxmaxx^w. 

xeD weCt 


After submitting xt to the oracle, the algorithm receives yt that satisfies ©• Given the 
past action-feedback pairs (xi,yi),... (xt^yt), tbe confidence region Ct+i is constructed as 
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follows. The center of Ct+i is found by minimizing the ^ 2 -regularized square loss, i.e., 

t 

wt+i = argmax^(x7w - yif + A||w||2. 


2=1 


Notice that Wi+i can be computed efficiently in an online fashion. Let = XI + _ 

y^*_j XjX^. Based on the self-normalized bound for vector-valued martingales ( Abbasi-vadkori et al 
201 ih . the width of Ct+i can be characterized by 


(w - wt+i)’^At+i(w - wt+i) < 6t+i 

for some constant 6t+i > 0. As can be seen, the above procedure for constructing the 
confidence region is specially designed for the observation model in ([1]), and thus cannot be 
applied to the model in ([3]). 

2.4 Generalized Linear Bandit (GLB) 

Filippi et al. ( 20101 ) extend SLB to the nonlinear case based on the Generalized Linear Model 
framework of statistics. In the so-called GLB model, yt is assumed to satisfy E[yt|xi] = 
//(x^w*) where /r : M i—?■ M is certain link function. The regret is also defined in terms of 
//(•) and given by 


T max u(x''^w*) — 
xeD ^ 

t=i 


y{xj w 


(5) 


Note that by setting /i(x) = exp(x)/[l -|- exp(x)], the problem considered in this paper 
becomes a special case of GLB. A UGB-type algorithm has been proposed for GLB and 
also achieves a regret bound of 0{dVT). Different from Gon fidenceBalB which c onstructs 
a confidence region in the parameter space, the algorithm of Filippi et al. ( 201Cll l operates 
only in the reward space. However, the space and time complexities of that algorithm in 
the t-th iteration are 0{t) and 0{t+ |D|), respectively. The 0{t) factor comes from the fact 
it needs to store the past action-feedback pairs (xi,yi), ... {xt-i,yt-i) and use all of them 
to estimate w*. The 0(|D|) factor is due to the fact it needs to calculate an upper bound 
for each arm in order to decide the next action xj. 

2.5 Adversarial Setting 

All the results mentioned above are under the stochastic setting, where the reward of each 
arm is generated from a unknown but fixed distribution. A more gen eral setting is the adver¬ 
saria l case, in which the reward from each arm may change arbitrary (iBubeck and Gesa-Bianchil . 
2012l i. The most well-known method for the adversarial multi-armed ban dits is the Exp3 


algorithm, which achieves a regret bound of 0{VKT) ( Auer et ah . 200,31 ) . The problem 


of adversarial linear ban dits has been extensively studied, and the start-of-the-art regre t 
bound is 0{poly{d)VT) ( Dani et ah . 200^: lAbernethv et^ ^ gOO^Bubeck et ah . 2012). 
For more results, please refer to Bubeck and Gesa-Bianchi ( 20121 ). Shamir ( 201,31 ) and ref¬ 
erences therein. 
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2.6 Bandit Learning with One-bit Feedback 

There are several new varia nts of bandit l earning that also r ely on one one-bit feedback, 


such as multi-class band i ts (IKakade et alJ. l2008l : IChen et alJ . 120141 ') and i^-armed dueling 


bandits ( Yue et ah . 20091 : Ailon et ah . 20141 ) . For example, in multi-class bandits, the feed¬ 
back is whether the predicted label is correct or not, and in AT-armed dueling bandits, the 
feedback is the comparison between the rewards from two arms. However, none of them are 
designed for online linear optimization. 


2.7 One-bit Compressive Sensing (CS) 

Finally, we would like to discuss one closely related work in signal processing—one -bit Com¬ 
pressive Sensing (CS) ( Boiifounos and Baraniuk . 20081 : Plan and VershvninI 2010l ). One-bit 
CS aims to recover a sparse vectors w* from a set of one-bit measurements {yi} where yi is 
generated from w* according to certain observation model such as ([3]) . The main differ¬ 
ence is that one-bit CS is studied in batch setting with the goal to minimize the recovery 
error, while our problem is studied in online setting with the goal to minimize the regret. 


3. Online Learning for Logit Model (OL^M) 

We first describe the proposed algorithm for online stochastic linear optimization given one- 
bit feedback, next compare it with existing methods, then state its theoretical guarantees, 
and finally discuss implementation issues. 


3.1 The Algorithm 

For a positive definite matrix A G the weighted £ 2 -iiorm is defined by ||x||^ = x^Ax. 

Without loss of generality, we assume the decision space P is contained in the unit ball, 
that is, 

||x ||2 < 1, Vx € V. (6) 

We further assume the ^ 2 -norm of w* is upper bounded by some constant R, which is known 
to the learner. Our first observation is that the linear regret in ([2]) and the nonlinear regret 
in Q only differs by a constant factor as indicated below. 


Lemma 1 We have 


XT—< O < 
2(1-b exp(/?)) 4 


(7) 


In the following, we will develop an efficient algorithm that minimizes the linear regret, 
which in turn minimizes the nonlinear regret as well. 

The algorithm is motivated as follows. Suppose actions xi,... ,xt have been submitted 
to the oracle, and let yi,... be the one-bit feedback from the oracle. To approximate 
w*, the most straightforward way is to find the maximum likelihood estimator by solving 
the following logistic regression problem 


1 

min — 
w||2<ii t 


t 

^ log (l + exp{-yixj w)^ . 
i=l 
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Algorithm 1 Online Learning for Logit Model (OL^M) 
1 : Input: Step Size r], Regularization Parameter A 
2 : Zi = XI, Wi = 0 

3: for t = 1,2, ... do 

4: 


(xt,wt) = argmax x^w 
x&V,-w£Ct 


5: Submit xt and observe yt G {±1} 

6: Solve the optimization problem in (l8|) to find wt+i 

7: end for 


However, this approach does not scale well since it requires the leaner to store the entire 
learning history. Instead, we propose an online algorithm to find an approximate solution. 
The key observation is that the logistic loss 


/t(w) = log (l + exp(-ytx 7 w)^ 

is exponentially concave over bounded do main (Kazan et al.l. 20141 ). which motives us to 

' ' 120071 1 


apply a variant of the online Newton step (|Hazan et al.l . l2007l L Specifically, we propose to 


find an approximate solution w^+i by solving the following problem 


w — wO 7 

min --- —+'n{-w 

w||2<-R ^ 


( 8 ) 


where 77 > 0 is the step size. 


— Zt + —XiXj , 


(9) 


and 0 is defined in (I14p . Although our updating rule is similar to the method in (iHazan et al 
20071 1. there also exist some differences. As indicated by ([9 


in our case x^xj is used to 
is used. 


approximate the Hessian matrix, while in iHazan et al.l (120071 ) V/t(wt)[V/t(wt) 
After a theoretical analysis, we are able to show that with a high probability 


w* G Ct+i = {w : IIw - wt+i\\zt+i < a/07+T} (10) 

where the value of 7^+1 is given in (I12p . Given the confidence region, we adopt the principle 
of “optimism in face of uncertainty”, and the next action x^+i is given by 

(xt+i,Wi+i)= argmax x^w. (11) 

xsCiweCt+i 


At the beginning, we set 


Zi = XI, and wi = 0. 


The above procedure is summarized in Algorithm (U and is refer to as Online Learning for 
Logit Model (OL^M). 

Since both ConfidenceBall 2 ( Dani etliD . 2008al ) and our OL^M are UCB-type algo¬ 
rithms, their overall frameworks are similar. The main difference lies in the construction 
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of the confidence region and the related analysis. While ConfidenceBall 2 uses online least 
square to update the center of the confidence region, OL^M resorts to online Newton step. 
Due to the difference in the updatin g rule and the observation m odel, the self-normalized 
bound for vector-valued martingales (jAbbasi-vadkori et al.l . l201lh can not be applied here. 

Alt hough our obs e rvatio n model in ([3|) can be handled by the Generalized Linear Bandit 
(GLB) ( Filippi et ah . 2010l ). this paper differs from GLB in the following aspects. 

• To estimate w*, GLB needs to store the learning history and perform batch updating 
in each round. In contrast, the proposed OL^M performs online updating. 

• While GLB only considers a finite number of arms, we allow the number of arms to 
be infinite. 

• Our algorithm follows the learning framework of SLB. Thus, existing techniques for 
speeding up SLB can also be used to accelerate our algorithm, which is discussed in 
Section 13.31 


3.2 Theoretical Guarantees 

The main theoretical contribution of this paper is the following theorem regarding the 
confidence region of w* at each round. 

Theorem 1 With a probability at least 1 — 5, we have 

||wi+i - < V7t-ri, Vt > 0 


where 


It+i = 2r/ 


AR + 





Tt + ^ log 


det(Zt+i) 

det(Zi) 


-|- max 





( 12 ) 


Tt = log 

/3 = 


2[21og2 t\t^ 


1 


2(1 -I- exp(i?)) 


(13) 

(14) 


The main idea is to analyze the growth of Hw^+i — w* by exploring the properties of the 
logistic loss (Lemmas [2] and 01) and concentration inequalities for martingales (Lemma [5]). 
By a simple upper bound of logdet(Zt+i)/det(Zi), we can show that the width of the 
confidence region is 0{yjd log t). 


Corollary 2 We have 


and thus 


det(Zi+i) f rjfdt 


It+i < 0{d\ogt), yt > 0. 

Based on Theorem [H we have the following regret bound for OL^M. 
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Theorem 3 With a probability at least 1 — <5, we have 

T 

xei> 

holds for all T > 0. 


rp T T ^ A ItT det(Zr+i) 

1 maxx w* — > Xj w* < Aw —— log ■ 

■vC.'T) * ^ I * 


1=1 


T]^ 


det(Zi) 


Combining with the upper bound in Corollary [2l the above theorem implies our algorithm 
achie ves a regret bound of 0{dVT) which matches the bound for Stochastic Linear Ban¬ 
dits ( Dani et ah . 2008a). 


3.3 Implementation Issues 


The m ain computational cost of OL^M comes from (|lip which is NP-hard in general ( Dani et al 


2008al ). In the following, we discuss several strategies for reducing the computational cost. 

Optimization Over Ball As mentioned by Dani et ^ ( 2008al L in the special case that 
V is the unit ball, (fTTTl could be solved in time 0{poly{d)). Here, we provide an explana¬ 
tion using techniques from convex optimization. To this end, we rewrite the optimization 
problem in (jlip as follows 


max 


x''~w = 


which is equivalent to 


mm 

||w-wt+i|||^^^<7t+i 


max ||w ||2 

\v/--wt+i\\z-t,j^A-VWw 


— w 


The above problem is an optimization problem with a quadratic objective and one quadratic 
inequality constraint, it is well-known that strong d uality holds provided there exists a 
strictly feasible point ( Bovd and Vandenberghe . 20041 ) . Thus, we can solve its dual problem 
which is convex and given by 


max 7 
s. t. A > 0 

—I -|- AZt_|_i 

-Aw^’^iZ^+i 


-74+i) -7 


^ 0 


After obtaining the dual solution, we can get the primal solution based on KKT conditions. 
Enlarging the Confidence region For a positive definite matrix A G we define 

||x||i,A = ||A^/^x||i. 


When studying SLB, iDani et al.l (j2008al ) propose to enlarge the confidence region from 
Ct+i = {w : ||w — Wi+i to Ct+i = |w : ||w — < ^Jdry^'^ such 

that the computational cost could be reduced. This idea can be direct incorporated to our 
OL^M. Let £t+\ be the set of extremal points of Ct+i. With this modification, (ED becomes 


(xt_|_i,wt_|_i) = argmax x w= argmax x w 
xei’,weCt+i xei’,we£t+i 
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Algorithm 2 OL^M with Lazy Updating 
1: Input: Step Size r], Regularization Parameter A, Constant c 
2: Z\ = A/, wi = 0, r = 1 
3: for t = 1,2,... do 
4: if det{Zt) > (1-|-c) det(14) then 

5: 


(xi,wt) = argmax x'''w 
x&T>,-w&Ct 


6 

7 

8 
9 

10 

11 


T = t 


end if 

Xt = Xr 

Submit Xt and observe yt G {±1} 

Solve the optimization problem in ([8]) to find wt+i 

end for 


w hich means we just n eed to enumerate over the 2d vertices in St+i- Following the arguments 


m 


Dani et al.l (j2008a|), it is straightforward to show that the regret is only increased by a 


factor of Vd. 


Lazy Updating lAbbasi-vadkori et al.l (j201lh propose a lazy updating strategy which 


only needs to solve (fTTI) O(logr) times. The key idea is to recompute xj whenever det(Zt) 
increases by a constant factor (1 + c). While the computation cost is saved dramatically, the 
regret is only increased by a constant factor Vl + c. We provide the lazy updating version 
of OL^M in Algorithm [2j 


4. Analysis 

We here present the proofs of main theorems, 
appendix. 

4.1 Proof of Theorem [1] 

We begin with several lemmas that are central to our analysis. 


The omitted proofs are provided in the 


Although the application of online Newton step (jHazan et al.l . 120071 ) in Algorithm [T] 
is motivated from the fact that /t(w) is exponentially concave over bounded domain, our 
analysis is built upon a related but different property that the logistic loss log(l -|- exp(x)) 
is strongly convex over bounded domain, from which we obtain the following lemma. 

Lemma 2 Denote the ball of radius R by Br, i.e., Br = {w : ||w ||2 < R}. The following 
holds for /3 < 2(l+exp(fl)) •• 

/3 


/i(w 2 ) > /t(wi) [V/t(wi)]'^(w 2 - Wi) -k ^ (^(W2 - Wi)'^Xt^ , Vwi , W2 G Br. 
Comparing Lemma [2] with Lemma 3 in ( Kazan et ah . 200?! ). we can see that the quadratic 


term in our inequality does not depends on yt. This independence allows us to simplify the 
subsequent analysis involving martingales. 
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Our second lemma is devoted to analyzing the property of the updating rule in 

Lemma 3 

(wt - w*, V/t(Wi)) < ^ 


(15) 


277 2 r ] ' 2" ' 

For each function we denote its conditional expectation over yt by i-e., 

ftiw) = Ey^ log (^1 + exp (^-ytxjw^'^ . (16) 

Based the property of Kullback-Leibler divergence ( Cover and Thorn^. 2006ll . we obtain 
the following lemma. 

Lemma 4 IFe have 

ft {^) > Vw E 

Next, we introduce one inequality for bounding the weighted £ 2 -iiorm of the gradient 


l|V/i(w)||^ = 


exp(-ytx7w 


xjAxt < ||xt||4, Y 0 , w E 


(17) 


I + exp{-ytxjw )* 

We continue the proof of Theorem [1] in the following. Our updating rule in ([8]) ensures 
wt ||2 < R,yt > 0. Combining with the assumption ||w *||2 < R, Lemma [2] implies 

/3 / ^ n2 


/t(Wi) < /i(w*) + [V ft{wt)]^ {wt - W*) - - (^(w* - Wt)^Xi^ 
By taking expectation over yt, (fT8|) becomes 


(18) 


/t(wt) < /t(w*) + [V/t(wt)]'^(wt - w*) - ^ 


W* - Wi)'^Xi 


Combining with Lemma [U we have 

0 <[V/t(wt)]'^(wt - w*) - ^ ((w* - wt ) ' xt 




:=at 




=[V/t(wt)] (wt - w*) - -at + [ Vft { wt ) - V/t(wt)] (wt - w*) 

2 '-^' 


= [V/t(wt)]'^(wt - w*) - 


■=bt 

Wt-^A?z,+t , l|wi-w*|||^^^ 13 


2 r } 


+ 


2r/ 


- -^at + bt 


-1" O Iw/i(''^i)ll7-l “I-- —(^t + bt 


< - 


2 r } 


^t+i 


2 r } 


03 ||wt+i - rj 


< - 


2r/ 


' ,Vu „2 , A ,, 

+ TT '7-1 -;;- bt 


^t+i 


2 r ] 


® Iki+i-■"'.III,*, 13 , ,, , I) , , fi / T, 

=- - - -a, + b, + -c, + - - -+ -(x,(w,-w. 

||w,+i-w,||| 13 n l|w,-W.lll 

=- ^ -j», + (-.+ 2<=. +--• 
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We thus have 


Il'Wi+i - < ||wi - W*|||^ - -^at + 2r]bt + rj'^Ct 


Summing the above inequality over iterations 1 to t, we obtain 

vP ' ' 


wt+i + ^^ai< XR^ + 2r]^bi + r]‘^Y^ 


a. 


(19) 


2 = 1 


2=1 


2 = 1 


Next, we discuss how to bound the summation of martingale difference sequence Y^i=i 
To this end, we prove the following lemma, w hich is built up the Bernst ein’s inequality 
for m artingales ( Cesa-Bianchi and Lugosi . 2006l l and the peeling technique ( Bartlett et al 

200, l i. 


Lemma 5 With a probability at least 1 — 5, we have 


'y ^ bi < 422 -|- 2 ^ 


2 = 1 


^ O 

A Ti ^ a* -h -Rrt, Vt>0 
\j i=i 


where Tt is defined in m- 

From Lemma [5] and the basic inequality 


A /3 A 4 
nl^ai < - 2^ai + -n, 




2=1 


P 


with a probability at least 1 — (5, we have 


'^bi<‘iR+j'^ai+(- + -R^Tt 
2=1 2=1 

holds for alH > 0. Substituting (l2n]l into (fT^ . we obtain 


( 20 ) 


< XR^ -|- 2r] 






a. 


( 21 ) 


2=1 


F inally, we show an upper bound for Yl\=i which is a direct consequence of Lemma 
12 in Kazan et al. ( 2007l b 


Lemma 6 We have 


2=1 


T+i rjfi det(Zi) 

We complete the proof by combining (I2ip with the above lemma. 
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4.2 Proof of Lemma [2] 

We first show that the one-dimensional logistic loss i{x) = log(l -|- exp(—x)) is 2 (i+exp(_R)) ~ 
strongly convex over domain [—i?, i?]. It is easy to verify that Vx G [—i?, i?], 

^ exp(x) ^ 1 

(1-|-exp(x))2 2(l + exp(i2)) 

implying the strongly convexity of £(•). From the property of strongly convex, for any 
a, 6 G [—R, i?] we have 

m > i{a) + e'ia)ib -a) + ^{b- af. (22) 

Notice that for any ■wi,W 2 G Br, we have 


2/tx7w2 G [-R,R], 

since yt G {±1} and ||xt ||2 < 1- Substituting a = i/jx^wi and b = yt:xJ'W 2 into ([2^ . we 
have 

Kyt^J^2) > + ^{ytxjw2 - ytxjwif + e'{ytxjwi){ytxjw2 - ytxjwi). 

We complete the proof by noticing 

/t(wi) = £{ytx[wi), /i(w 2 ) = e{ytxjW 2 ), and V/t(wi) = £'{ytxjwi)ytXt. 


4.3 Proof of Lemma [ 3 ] 

Lemma [3] follows from a more general result stated below. 
Lemma 7 Let M be a positive definite matrix, and 

y = argmin7/(w,g) +l-\\w 
wew 2 


where W is a convex set. Then for all w G W, we have 


(x - w,g) < 




y-^IlM I ^|| ||2 

27? 


Proof Since y is th e optimal solution to the optim ization problem, from the first-order 
optimality condition ( Bovd and Vandenberghel. 20041 ) . we have 


(7?g -I- M(y — x), w — y) >0, Viv G W. 


(23) 


Based on the above inequality, we have 

l|x-w|||f - ||y - w|||f 
=x’^Mx - y’^My -h 2 (M(y - x), w) 

^ -r T 

> x~Mx - y'My + 2 {M{y -x),y) - 2 {r]g,w - y) 
= l|y - x||m + 2 {r]g, y -x + x-w) 

= 2 {rjg, X - w) ||y - xf^ + 2 (r?g, y - x) 


12 












Online Stochastic Linear Optimization under One-bit Feedback 


Combining with the following inequality 


|y - x|||f -h 2(7/g,y - x) > min||w|||^ 2(r/g,w) = -r/^||g| 




we have 


|x - w||if - ||y - w\\m > 2(r/g,x - 'w) - 7y^||g| 


M-i- 


4.4 Proof of Lemma |4] 

For each w G we introduce a discrete probability distribution over {±1} such that 

Pw(i) = —- j . i G {±1}. 

1 -|- exp(— zxj wj 

Then, it is easy to verify that 


As a result 


/t(w) = - ^ pw*(i)logPw(*)- 

ie{±i} 

/t(w) - /t(w*) 

= ^ ?*w*(*)logPw*(i) - ^'w.(*)logPw(i) 


ie{±i} 


ie{±i} 


= y] p^vM) log = Dkl{p^, IIPw) > 0 


wher e -Dxl('II') is the Kullback-Leibler divergence between two distributions ( Cover and Thomad . 

200611. 


2006l l. 

4.5 Proof of Lemma [5] 

We need the Bernstein’s inequality for martingales ( Cesa-Bianchi and Lugosi . 2006l l. which 
is provided in Appendix iDl Form our definition of /j(-) in (1161) . it is clear 

h = [V//(wj) - V/i(wj)]’^(wj - w*) 

is a martingale difference sequence. Furthermore, 

l^il < [V/i(’Wj)]’^(wi - + [V/i(’Wj)]’^(wi - < 2|x 7(’Wj-w*)| < 2||wi-w*||2 < 4i?. 

Define the martingale Bf = Define the conditional variance 


as 


t r . 

y = Ey, ([V/i(wi) - V/i(wi)]^(wi - w*)) 


Z=1 

t 


L ~ 2“ ^ 

^ Y (Wi(wi)'^(wi - w*)) < Y 

i=l ^ ^ i=\ 


(Wi - 


■ =At 
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where the hrst inequality is due to the fact that E[(^ — E[^])^] < E[^^] for any random 
variable 

In the following, we consider two different scenarios, i.e.. At < and At > 

At < In this case, we have 


Bt<'^\bi\ < 2 ^ |x7(wj - w*)| < 2 


i=l 


2=1 


\ 




(24) 


2=1 


At > Since At in the upper bound for is a random variable, we cannot apply 

Bernstein’s inequality directly. To address this issue, we make use of the peeling process 


([Bartlett et al.l . 120051) . Note that we have both a lower bound and an upper bound for At, 
i.e., AR‘^/t < At< ARH. Then, 


Pr 

= Pr 

= Pr 


Bt > AtTt + -Rrt 


4R^ 


Bt > 2 -\/AtTt + '^B.Tt, < At < AR?t 

AR^ 


Bt > AtTt + '^Brt, Tit < At, —^— < At < AR^t 


m 


<^Pr 

2=1 

m 

<^Pr 

2=1 


n ,— 8., ..2 . ^ 4i?22* 

Bt > 2\/AtTt + —RTt, Tt < At, - 7 - < At ^ —-— 

O L if 


, AR‘^2^ AR‘^2^ 

Bt > \l 2 - Tt + ’^B.Tt, Tt < - 


< me 


where m = [21og2t], and the last step follows the Bernstein’s inequality for martingales. 
By setting Tt = log with a probability at least 1 — 6/[2t‘^], we have 

Bt < 2^/A^t + ^Rn- (25) 

Combining (IM)) and ([2^ . with a probability at least 1 — 5/[2t^], we have 


Bt<AR + 2 s/ AtTt + -^RR- 


We complete the proof by taking the union bound over t > 0, and using the well-known 
result 


OO 


E 


1 



2 . 


4.6 Proof of Theorem [3] 

The p roof is standard and can be found from Dani et al. ( 2nf)8al ) and Abbasi-vadkori et al 
( 201ll '). We include it for the sake of completeness. 
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Let X* = argmax^gj) Recall that in each round, we have 

(xi,Wi) = argmax x"*^w. 
xei5,weCt 

We decompose the instantaneous regret at round t as follows 

X J w* - x7 W* 

<x7 Wt - x7w* = x7 {wt - Wf) -h x7 (wt - w*) 

< (||wt - wtWzt + l|wt - ||xt||^-i < 2y/^||xt||^-i. 

On the other hand, we always have 

xjw* — X^w* < ||x* — Xi|| 2 ||w *||2 < 2R. 


From the definition in (|12|) . we have y It > R- Thus, the total regret can be upper 
bounded by 

T 

T maxx''~w* — xJw* 


xex> 
T 


t=i 


1=1 


<2^min f y/^||xi||^-i,R 



To proceed, we need the following results from Lemma 11 in lAbbasi-vadkori et al.l (|201ll ). 
T / ^ X T 


^ min ( ^||xt||2 ^, 1) < 2 ^ log ( 1 + ^ ||xi| 


1=1 


i=l 


and 


det(ZT+i) = det (^t + ^x^xj 
= det(Zr)det 


= det(2'r) ( 1 + ^||xr|||-i ) = det(Zi) ( 1 + ^||xi 


i=l 




Combining the above inequations, we have 

T 


rmjgxTw. - 


1 = 1 
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5. Conclusions 


In this paper, we consider the problem of online linear optimization under one-bit feedback. 
Under the assumption that the binary feedback is generated from the logit model, we develop 
a variant of the online Newton step to approximate the unknown vector, and discuss how 
to construct the confidence region theoretically. Given the confidence region, we choose the 
action that produces maximal reward in each round. Theoretical analysis reveals that our 
algorithm achieves a regret bound of 0{dVT). 

The current algorithm assumes that the one-bit feedback is generated from a logit model. 
In contr ast, a much broader class o f observation models are allowed in one-bit compressive 
sensing ( Plan and Vershvnin . 2013l b as long as there is a positive correlation between the 
one-bit output and the real-valued measurement. In the future, we will investigate how 
to extend our algorithm to other observation models. Another direction is to consider the 
adversary setting where the unknown vector w* may change from time to time. 


Appendix A. Proof of Lemma [T] 

Let iJ.{x) = It is easy to verify that Vx G [— i?,i?]. 


I 


< = 


exp(x) 


I 

2(1 -|- exp(i?)) (^1 _|_ exp(x))2 “ 4 

Note that for any —R < a < b < R, we have 

fb 

^ lx 


/j{b) = /x(a) -|- / /j.'{x)da 

J a 


Combining (I26p with (|27ll . we have 

1 


1 


Let 


Since —R < w* < xj w* < R, we have 


2(1 + exp(fl)) - 4*" - “> 


y exp(x'''w*) 

X* = argmax x w* = argmax--- 

xex> xec l-bexp(x'w*) 

T, 


1 


2(1 -(- exp(i?)) 
which implies ([7]) 


x7 w* - x7 w* ) < 


exp(xj 'w. 


exp(x7 w* 


l-bexp(x7w*) 1-Kexp(x7w*) 4 


(26) 


(27) 


< - ( xJ w* — Xj' w* 


T, 


Appendix B. Proof of Lemma [6] 

We have 


I ii2 _ ^ /'z—1 v v \ ^ ^ 1 ‘iet(.Zj_(_i) 

det{Zi) ’ 


where the inequality follows from Lemma 12 in iHazan et al.l (j2007l i. Thus, we have 


II ||2 ^ ^ det(Zj_|_i) 2 det(Ztyi) 
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Appendix 

Recall that 


and ||xt ||2 < 


C. Proof of Corollary 


Zt+i — Zi + 


rj/3 


E 

i=l 




1 for all t > 0. From Lemma 10 of lAbbasi-vadkori et al. ( 2011 ). we have 


Since det(Zi) = we have 


log 


det(Zt+i) < ( A -|- 


dei{Zt+i) 

det(Zi) 


riPtV 
2d ) 


< d log 1 + 


2Xd 


Appendix D. Bernstein’s Ineqnality for Martingales 

Theorem 4 Let Xi,... ,Xn be a bounded martingale difference sequence with respect to the 
filtration T = (l^i)i<i<n and with \Xi\ < K. Let 


i 

s. = EA 

1=1 

he the associated martingale. Denote the sum of the conditional variances by 

n 

t=l 


Then for all constants t, > 0, 


Pr 


max Si > t and < u 

2=1,. ..,71 


< exp — 


2{u + Kt/3) J ’ 


and therefore, 


Pr 


max Si > ^/2vt H —Kt and < u 


< e~\ 
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